Floppyshop 2

home *** CD-ROM | disk | FTP | other *** search

/ Floppyshop 2 / Floppyshop - 2.zip / Floppyshop - 2.iso / diskmags / 5791-.end / dmg-6143 / lza_rept / compres1.txt next >

Wrap

Text File | 1995-05-01 | 17KB | 324 lines

Experiments in Data Compression by John Kormylo E.L.F. Software Introduction When first developed, LZA compression was superior to that obtained using ZIP or LZH. Since then, both ZIP and LZH have improved their compression efficiency dramatically. This report documents attempts to improve LZA to regain its advantage. While it is possible to out-perform ZIP consistently, the resulting algorithm is glacially slow. LZA Description The LZA algorithm is a variant of LZSS and adaptive arithmetic (see "The Data Compression Book" by Mark Nelson). It differs from standard adaptive arithmetic by using a "new code" character. It differs from LZSS by using additional codes to transmit match tokens. The "new code" character is followed by a (relatively) uncompressed character which had previously been assigned a count of zero. Adaptive arithmetic normally assigns a minimum count of one to all 256 characters, which makes the coding inefficient when only a few characters are used (e.g., text files). The overhead for entering the character set using the "new code" character is comparable to the overhead for transmitting count information for a non-adaptive method. The "new code" character always has a count of one (until no new codes are left.) The new codes are entered by dividing the current range (high - low) into 'n' segments of equal size, where 'n' is the number of remaining characters with a count of zero. LZSS adds an extra bit before every character or match token to indicate which is being used. LZA introduces an additional character which is followed by the match token. If there are exactly the same number of tokens as characters transmitted, then this additional character will require only one bit to be coded and every other character will require one additional bit to be coded. For any ratio other than one to one, using this special character is more efficient than using a single bit. Actually, LZA adds 64 additional characters corresponding to 64 possible match lengths. This takes advantage of the fact that short matches are far more common than long matches by assigning a count to each match length. Also, since matches longer than 32 bytes are rare, assigning an initial count of zero and using the "new code" character turns out to be very effective. Past locations in the message are organized into a self- balancing binary search tree for faster matching. LZA Improvements Not all past locations should be included in the binary search tree. Nelson mentioned that it is common to not include those locations skipped over when a match is found for reasons of speed. As it turns out, the probability of matching one of these locations is so low that not including them in the search tree improves the compression as well. This is not the case, however, with the location matched. While there is already another location in the search tree which matches the first few characters, the probability of finding a longer match using the new location is high enough to justify its inclusion. Another improvement can be made by keeping a count for how often each location is used. Rather than dividing the current range into segments of equal size for each location in the search tree, the size of the segments can be made proportional to this count (set to one initially). Also, when the tree is full, the new LZA does not throw out old entries with counts greater than one. Instead it decrements the count and allows it to cycle through again (until the count goes to zero). For speed, LZA keeps a table of search tree indexes sorted by count (in decreasing order), plus an array containing the number of entries with counts 1 through 16. The sum of all counts up to an entry with a count of one is given by the sum of all counts (precomputed) minus the number of entries after the index (times one). This is much faster than summing up to 16K numbers. Also, if a location is used more than once, odds are that it will use the same length as before. Therefore, one of the length codes means "use the same length as last time." (It replaced the length = 2 code, which was never used in practice.) Since a location must have been used before for this length to be defined, only those search tree entries with counts greater than one are used when computing the index code. While searching through the binary tree will easily locate the longest match, it will not necessarily locate the best match of that length (the one with the highest count). Any match with the same length as the one found will be in the sub-tree starting with the entry found, but one must search both branches to find all of them. This can be done using a recursive function to search both branches whenever a duplicate match is found. Tree Size Experiments Text files generally do best with a very large tree (16K entries or more). Programs do better with trees with fewer entries. A sorted text file (actually, a dictionary) worked best with a 16 entry search tree. The following table shows the compressed file size for three test datasets as a function of tree size. For comparison the corresponding ZIP compressed file sizes are included. (Two of these files can be downloaded from Genie, and one is from the Atari developer's package.) One way to improve performance over a wide range of tree sizes is to use more than one tree at a time and include a code to select which tree to use. Best results were obtained using 3 trees: sizes 16, 512 & 16K. These results are included in the following table (as "3 trees"). Actually, there are many ways to do this. One way would be to use a different group of length codes for each tree. Another would be to use regular codes to indicate the tree selected and a separate code set (like the index code) for the length indicator. However best results were obtained using the same length codes as before and a separate set of codes for the tree selection indicator. tree | DEMONSTR.DOC | ELFBACK.PRG | WORDLIST.TXT | -------+--------------+-------------+--------------+ 16 | 45055 | 46100 | > 476285 | 32 | 44186 | 43144 | 488860 | 64 | 40211 | 41018 | 506394 | 128 | 37345 | 38961 | 521958 | 256 | 34163 | 36954 | 539121 | 512 | 31198 | 35443 | 559172 | 1024 | 28756 | 34483 | 582964 | 2048 | 27039 | 33763 | 612184 | 4096 | 25765 | > 33673 | 639819 | 8192 | 25114 | 33822 | 658753 | 16384 | > 24942 | 34003 | 670774 | -------+--------------+-------------+--------------+ ZIP | 25207 | 32444 | 496880 | -------+--------------+-------------+--------------+ 3 trees| 24928 | 33073 | 559810 | avg | 24638 | 33139 | 556216 | -------+--------------+-------------+--------------+ gaps | 28202 | 33867 | 471766 | trim | 24910 | 33163 | 531371 | best | 22828 | 32191 | 457176 | It is interesting to note that this method out-performed the best single tree size for two of the three files. Also, the resulting counts showed that all three trees were used frequently, and that the smaller trees were used far more often than would be indicated by their relative sizes. Up to this point, comparisons between matches of different lengths were made by adding the cost of the missing characters to the shorter of the two matches. However, it is possible that more than one short match could replace a long match. To overcome this problem, one can compare the average cost per character for the two matches. Specifically, this is computed using avg = log(4294967296.0 / size) / len where 'len' is the number of characters and 'size' is the segment size (starting from 4294967296) for resulting arithmetic compression for the length, tree, and index codes. This not only works well for selecting which tree to use, but also whether to use a match at all (by comparing the average cost per character to the cost of the first character). The results are shown in the first table (as 'avg'). The most surprizing thing is that this approach performed so poorly on WORDLIST.TXT relative to how it performed using only the 16 entry tree. A test was run which used the 16 entry tree wherever possible, and the other trees only when no match was found. The resulting compressed dataset was 556869 bytes long. It was also discovered that the 'avg' method used the 16 entry tree 93278 times, the "use 16 if possible" test used the 16 entry tree 99475 times, but using the 16 tree only found 164708 matches. These matches must have come from INSIDE the matches found in the 512 and 16K entry trees. Obviously, no "greedy" algorithm was going to find those matches. Another test was run which only looked for matches from the large search trees to fill in the gaps between matches with the small trees. The results are shown in the first table (as 'gaps'). While this worked well for WORDLIST.TXT, it performed very poorly on DEMONSTR.DOC. A similar test was run which used shorter matches when the longest match found extended past the start of a match from the 16 entry tree. The results are shown in the first table (as 'trim'). This performed well on all test files. While these tests were not good enough, they did show that using three trees could give acceptable results if the proper matches were used. However, to find these matches it would be necessary to test all possible ways to compress the data. The algorithm tried would check all possible combinations of matches and characters until a unique first step was found, 32 bits of compressed output was produced, or 256 bytes of input data were compressed (whichever came first). After the first step was implemented, the process was started all over again since the counts and search trees had been modified and the old analysis was no longer valid. The results are shown in the first table (as 'best'). The only problem with this method is that it is VERY slow. Finding Contiguous Matches In this section we will attempt to equal the 'best' algorithm using another approach. If a match can be found which extends past the end of a larger match, but no match can be found at the end of the larger match, then the location inside the smaller match corresponding to the end of the larger match must have been skipped. Therefore by including the skipped locations in (some) of the search trees, one should be able to locate more contiguous matches. The following table shows the compressed file sizes using a single search tree which includes all locations (no skipping). These do not perform nearly as well as the corresponding trees in the first table. If we combine the 3 trees as before, the results are surprisingly respectable. The '0,3' values use no skipping in all three trees (16, 512, & 16K). The '1,2' values use skipping in the 16K entry tree but not in the other two. The '2,1' values use skipping in all but the 16 entry tree. tree | DEMONSTR.DOC | ELFBACK.PRG | WORDLIST.TXT | -------+--------------+-------------+--------------+ 16 | 45334 | 47560 | 623420 | 512 | 39661 | 40590 | 620509 | 16384 | 34762 | 37505 | 696476 | -------+--------------+-------------+--------------+ 0,3 | 25401 | 32829 | 516093 | 1,2 | 24570 | 33000 | 514302 | 2,1 | 24641 | 33163 | 511917 | -------+--------------+-------------+--------------+ 3,1 | 24612 | 33094 | 507442 | 3,2 | 24457 | 32843 | 504815 | -------+--------------+-------------+--------------+ mixed | 24261 | 32854 | 503976 | Even better results can be obtained using additional trees which contain only the (previously) skipped locations. The '3,1' values use the previous 3 trees (16, 512 & 16K) plus an additional 16 entry tree which contains ONLY skipped locations. The '3,2' values use the same plus an additional 512 entry tree which also contains ONLY skipped locations. The best results for this approach were obtained using a '3,2' tree combination as before, but now the first 16 entry tree contains ONLY the matched locations while the second 16 entry tree contains ALL locations. These results are shown in the above table (as 'mixed'). Biting the Bullet While the results in Table 2 are interesting, the only way to consistently out-perform ZIP is to use the full search ('best') algorithm. In this section we will be concerned with improving this approach. There are several reasons why this approach is more time consuming. First, it must search for a match at every location. Second, it must search for the best match for each possible length. The time required to perform this search is roughly proportional to how many matches are found. Finally, it must repeat some of these searches, since the results are thrown away once the next code is determined. In the original algorithm, the limiting factor was almost always the 32 bits of output. In fact, one can reduce the maximum number of input bytes to 65 with no penalty at all. Further reduction requires also reducing the maximum match length. The results for various length parameters in shown in the next table. However, the resulting increase in speed was not substantial. length | DEMONSTR.DOC | ELFBACK.PRG | WORDLIST.TXT | -------+--------------+-------------+--------------+ 65 | 22828 | 32191 | 457176 | 33 | 22928 | 32261 | 457174 | 17 | 23794 | 32726 | 455700 | 9 | 26552 | 34073 | 484471 | On the other hand, by increasing the number of output bits (actually, by computing segments sizes with floating point numbers instead of unsigned longs) one can improve the compression, as shown in the next table. At least using floating point operations does not significanly slow it down any further. The limiting factor now becomes the number of tests needed to determine a unique first step. This is usually only slightly larger than the initial match length. length | DEMONSTR.DOC | ELFBACK.PRG | WORDLIST.TXT | -------+--------------+-------------+--------------+ 65 | 22196 | 32035 | 455017 | 33 | 22329 | 32137 | 455261 | -------+--------------+-------------+--------------+ len | 22522 | 32054 | 461290 | first | 22595 | 32067 | 479614 | save | 22170 | 32037 | 456498 | One way to reduce the number of match searches is to assume that a match taken from a smaller tree is always preferable to the same length match taken from a larger tree. This turns out not to be the case, however, as shown in the above table (as 'len'). Nor does it run much faster. Alternatively, one could eschew the search for the best match, taking the first match found. Again, there is little time saved at the cost of compression efficiency. The results are shown in the above table (as 'first'). Another possible way to improve speed would be to save the results after each step (as opposed to starting all over again). This would mean far fewer searches. However, it would require performing all of the search tree updates and sorts and saving enough information to undo them. Because of all the additional tree and sort operations, this approach is only slightly faster than the 'best' algorithm. The good news is that the performance is about the same, as shown in the above table (as 'save'). Since LZW is inherently faster than LZSS, attempts were made to use LZW variants together with the full search. However, not only were the results not even in the same ballpark, the algorithm speed still left a lot to be desired.